Search Result

Select

Word-Based Domain Feature-Sensitive Multi-domain Neural Machine Translation

HUANG Zengcheng, MAN Zhibo, ZHANG Yujie, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2023, 59 (1): 1-10. DOI: 10.13209/j.0479-8023.2022.063

Abstract （776）

HTML

PDF（pc）（802KB）（240）

Save

The accuracy of the existing word-based domain feature learning methods on domain discrimination is still low and the further research for domain feature learning is required. In order to improve domain discrimination and provide accurate translation, this paper proposes a word-based domain feature-sensitive learning mechanism, including 1) the context feature encoding at encoder side, to widen the study range of word-based domain features, introducing convolutional neural networks (CNN) in encoder for extracting features from word strings with different lengths in parallel as word context features; and 2) enhanced domain feature learning. A domain discriminator module based on multi-layer perceptions (MLP) is designed to enhance the learning ability of obtaining more accurate domain proportion from word context features and improve the accuracy of word domain discrimination. Experiments on English-Chinese task of UM-Corpus and English-French task of OPUS show that the average BLEU scores of the proposed method exceed the strong baseline by 0.82 and 1.06 respectively. The accuracy of domain discrimination is improved by 10.07% and 18.06% compared with the baseline. More studies illustrate that the improvements of average BLEU scores and accuracy of domain discrimation are contributed by the proposed word-based domain feature-sensitive learning mechanism.

Related Articles | Metrics | Comments（0）

Select

Incorporating Clause Alignment Knowledge into Chinese-English Neural Machine Translation

MIAO Guoyi, LIU Mingtong, CHEN Yufeng, XU Jin’an, ZHANG Yujie, FENG Wenhe

Acta Scientiarum Naturalium Universitatis Pekinensis 2022, 58 (1): 61-68. DOI: 10.13209/j.0479-8023.2021.111

Abstract （588）

HTML

PDF（pc）（729KB）（103）

Save

Currently, neural machine translation (NMT) is insufficient in capturing the semantic and structural relationships between clauses in complex sentences, which often results in poor discourse coherence of long and complex sentence translation. To address this problem, the paper proposes a Chinese-English NMT approach by integrating the clause alignment knowledge into NMT. Firstly, a labeling scheme combining manual and automatic annotation is introduced to annotate a large-scale clause aligned Chinese-English parallel corpus that provides rich clause-level Chinese-English bilingual alignment knowledge for model training. Then, a NMT model is designed based on clause alignment learning for enhancing the ability of the model to learn the semantic structure relationships between clauses within complex sentences. Experimental results on WMT17, WMT18 and WMT19 Chinese-English translation tasks demonstrate that proposed method can significantly improve the NMT performance. Evaluation and analysis show that proposed method can effectively improve the discourse coherence of complex sentence in Chinese-English machine translation.

Related Articles | Metrics | Comments（0）

Select

Multi-modality Paraphrase Generation Model Integrating Image Information

MA Chao, WAN Zhang, ZHANG Yujie, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2022, 58 (1): 45-53. DOI: 10.13209/j.0479-8023.2021.110

Abstract （635）

HTML

PDF（pc）（15716KB）（68）

Save

In multi-modality scenarios such as commodity descriptions and news comments, existing paraphrase generation models can not utilize information from image and therefore result in the loss of semantics in the generated paraphrases. In order to solve this problem, this paper first propose the Multi-modality Paraphrase Generation (MPG) model to integrate image information for paraphrase generation. In MPG, in order to integrate the image information corresponding to the original sentence, the authors first construct an abstract scene graph and transform the image features into node features of the scene graph. Furthermore, the constructed scene graph was utilized to generate paraphrase, by using the relational graph convolutional neural network for encoder and graph-based attention mechanism for decoder. In the evaluation stage, a sentence pair similarity calculation method was proposed to select sentence pairs describing same objects from the MSCOCO data set, and then evaluation experiments were conducted. Experimental results show that the proposed MPG model achieve better semantic fidelity, which indicates that the integration of image information is effective in improving the quality of the paraphrase generation in multi-modality scenarios.

Related Articles | Metrics | Comments（0）

Select

A Review of Entity Linking Research Based on Deep Learning

LI Tianran, LIU Mingtong, ZHANG Yujie, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2021, 57 (1): 91-98. DOI: 10.13209/j.0479-8023.2020.077

Abstract （1443）

HTML

PDF（pc）（457KB）（477）

Save

The authors introduce the concept and steps of entity linking in detail, and the problems and research status of named entity linking based on deep learning in recent years, analyze the problems and corresponding solution models of entity linking and present related data sets and evaluation methods. The authors summarize the current status of entity linking in international evaluation conferences and analyze the future research directions.

Related Articles | Metrics | Comments（0）

Select

Unsupervised Syntactically Controllable Paraphrase Network for Adversarial Example Generation

YANG Erguang, LIU Mingtong, ZHANG Yujie, MENG Yao, HU Changjian, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2021, 57 (1): 83-90. DOI: 10.13209/j.0479-8023.2020.079

Abstract （965）

HTML

PDF（pc）（570KB）（183）

Save

Prior work on adversarial example generation with syntactically controlled paraphrase networks requires large-scale paraphrase parallel corpora to train models. The performance of the model is seriously limited by the domain and scale of paraphrase parallel corpus. To solve this problem, this paper proposes an unsuprervised syntactically controlled paraphrase model to generate adversarial examples which only needs monolingual data. Specifically, variational autoencoder is used to learn model, which maps a sentence and a syntactic parse tree into semantic and syntactic variables, respectively. By learning to reconstruct the input sentence from syntactic and semantic variables, the model effectively learns to generate syntactic paraphrases without using any parallel data. Experiment results on unsupervised sentence paraphrasing and adversarial example generation demonstrate that the proposed model achieves new state-of-the-art results on unsupervised paraphrase generation and generate effective adversarial examples. These examples can be used to improve the robustness and generalization of NLP (natural language processing) model.

Related Articles | Metrics | Comments（0）

Select

Object Space Relation Mechanism Fused Image Caption Method

WAN Zhang, ZHANG Yujie, LIU Mingtong, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2021, 57 (1): 75-82. DOI: 10.13209/j.0479-8023.2020.080

Abstract （682）

HTML

PDF（pc）（5984KB）（208）

Save

Focusing on the specific information of the positional relationship between objects in the image, a neural network image summary generation model integrating spatial relationship mechanism is proposed, in order to provide key information (object position or trajectory) for downstream tasks such as visual question answering and voice navigation. In order to enhance the learning ability of the positional relationship between objects of the image encoder, the geometric attention mechanism is introduced by improving the Transformer structure, and the positional relationship between objects is explicitly integrated into the appearance information of the objects. In order to assist in the completion of specific information-oriented extraction and summary generation tasks, a data production method for relative position relations is further proposed, and the image abstract data set Re-Position of the position relations between objects is produced based on the SpatialSense data set. The experimental results of comparative evaluation with five typical models show that the five indicators of the proposed model are better than those of other models on the public test set COCO, and all six indicators are better than those of other models on Re-Position data set.

Related Articles | Metrics | Comments（0）

Select

Research on the Construction and Application of Paraphrase Parallel Corpus

WANG Yasong, LIU Mingtong, ZHANG Yujie, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2021, 57 (1): 68-74. DOI: 10.13209/j.0479-8023.2020.078

Abstract （946）

HTML

PDF（pc）（398KB）（134）

Save

Taking Chinese as the research object, the authors put forward the method to construct large-scale and high-quality paraphrase parallel corpora. The paraphrase data augmentation method include transfering English paraphrase corpus to Chinese, by using the method of translation engines, and manually annotating evaluation data set. Based on the constructed Chinese paraphrase data, the validity of the paraphrase data construction application method is verified in the paraphrase recognition task and natural language inference task. Firstly, the paraphrase recognition data is generated based on the constructed paraphrase corpus, and the attention-based neural network model of sentence matching is pre-trained to capture the paraphrase information. Then, the pre-trained model is applied to the natural language inference task to improve the performance. The experimental results on the open set show that the constructed paraphrase corpus can be effectively applied to the paraphrase recognition task, and the model can learn paraphrase knowledge. When applied to natural language inference task, paraphrase knowledge can effectively improve the accuracy of natural language inference models and verify the effectiveness of paraphrase knowledge for downstream semantic understanding tasks. Meanwhile, the proposed construction method for the paraphrase corpus is language-independent, which can provide more training data for other languages and fields, generate high-quality paraphrase data, and further improve the performance of other tasks.

Related Articles | Metrics | Comments（0）

Select

A Multi-Mechanism Fused Paraphrase Generation Model with Joint Auto-Encoding Learning

LIU Mingtong, ZHANG Yujie, ZHANG Shu, MENG Yao, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2020, 56 (1): 53-60. DOI: 10.13209/j.0479-8023.2019.104

Abstract （964）

HTML

PDF（pc）（828KB）（173）

Save

Neural network encoder-decoder framework has become the popular method for paraphrase generation, but there are still two problems. On the one hand, there are such issues as inaccurate entity words, unknown words and word repetition in the generated paraphrase sentences. To solve the first problem, we proposed a multimechanism fused paraphrase generation model to improve the decoder. The copy mechanism was used to copy words form input sentence for improving the generation of entity and unknown words. The coverage mechanism was used to model historical attention information to avoid word repetition. On the other hand, the limited-scale parallel paraphrase corpus limits the learning ability of the encoder. We proposed to jointly learn auto-encoding task, which shares one encoder with paraphrase generation task. The joint auto-encoding task enhances the learning ability of the encoder. Experimental results on Quora paraphrase dataset show that the multi-mechanism fused paraphrase generation model with joint auto-encoding task can effectively improve the performance of paraphrase generation.

Related Articles | Metrics | Comments（0）

Select

A Neural Paraphrase Identification Model Based on Syntactic Structure

LIU Mingtong, ZHANG Yujie, XU Jin’an, CHEN Yufeng

Acta Scientiarum Naturalium Universitatis Pekinensis 2020, 56 (1): 45-52. DOI: 10.13209/j.0479-8023.2019.092

Abstract （1057）

HTML

PDF（pc）（1016KB）（127）

Save

Paraphrase identification involves natural language semantic understanding. Most previous methods regarded sentences as sequential structures, and used sequential neural network for semantic composition. These methods do not consider the influence of syntactic structure on semantic computation. In this paper, we proposed a neural paraphrase identification model based on syntactic structure, and designed a tree-based neural network model for semantic composition, which extended the semantic representation from word level to phrase level. Furthermore, this paper proposed a syntactic tree alignment mechanism based on phrase-level semantic representation, and extracted features by using cross-sentence attention mechanism. Finally, a self-attention mechanism was used to enhance semantic representation, which could effectively model context information based on syntactic structure. Experiments on Quora paraphrase dataset show that the performance of paraphrase identification has been improved to 89.3% accuracy. The results further prove that the proposed semantic composition method based on syntactic structure, phrase-level cross sentence attention and self-attention are effective in improving paraphrase identification.

Related Articles | Metrics | Comments（0）

Select

Distant Supervision for Relation Extraction with Gate Mechanism

LI Xingya, CHEN Yufeng, XU Jin’an, ZHANG Yujie

Acta Scientiarum Naturalium Universitatis Pekinensis 2020, 56 (1): 39-44. DOI: 10.13209/j.0479-8023.2019.101

Abstract （1143）

HTML

PDF（pc）（699KB）（164）

Save

A piecewise convolutional neural network with gating mechanism is proposed, which would automatically filter positive correlation features at word-level. Moreover, the idea of soft-label is introduced to the gating mechanism to weaken the impact of hard labels on noise filtering. Combined with sentence-level noise filtering, the overall performance of the model is improved. The experimental results on the public dataset show that the proposed model has a significant improvement compared to the sentence-level noise filtering methods.

Related Articles | Metrics | Comments（0）

Select

Integrating of Grapheme-Based and Phoneme-Based Transliteration Unit Alignment Method

LIU Bojia, XU Jin’an, CHEN Yufeng, ZHANG Yujie

Acta Scientiarum Naturalium Universitatis Pekinensis 2016, 52 (1): 75-80. DOI: 10.13209/j.0479-8023.2016.001

Abstract （883）

HTML

PDF（pc）（362KB）（890）

Save

In order to solve the errors caused by only using the pheneme-based method or the grapheme-based method, applying the theory of statistics and rules, this paper proposes a new method for transliteration unit alignment which integrates the two main transliteration methods. Four experiments are designed to compare with the traditional methods. Experimental results show that proposed method outperforms other methods in terms of performance in machine transliteration.

Related Articles | Metrics | Comments（0）

Select

Improved Statistical Machine Translation with Source Language Paraphrase

SU Chen,ZHANG Yujie,GUO Zhen,XU Jin’an

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （808）

PDF（pc）（448KB）（353）

Save

The performance of statistical machine translation (SMT) suffers from the insufficiency of parallel corpus. To solve the problem, the authors propose a paraphrase based SMT framework with three solutions: 1) acquiring paraphrase knowledge based on a third language; 2) expressing multiple paraphrases of input sentence in a lattice and modifying decoder to be able to process it; 3) integrating paraphrase knowledge as features into log-linear model. In this way, not only more expressions in source language can be covered, but also more expressions in target language can be generated as candidate translations. To verify proposed method, experiments are conducted on three training data sets with different sizes, and evaluate the improvement of the performance of SMT system contributed by paraphrasing. Experimental results show that the translation performance is improved significantly (BLEU+1.4%) when the parallel corpus is small (10 K), and a good performance (BLEU+0.32%) is also achieved when parallel corpus is large enough (1 M).

Related Articles | Metrics | Comments（0）

Select

Summarization Based on Hidden Topic Markov Model with Multi-features

LIU Jiangming,XU Jin’an,ZHANG Yujie

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （623）

PDF（pc）（474KB）（448）

Save

Based on hidden topic Markov model (HTMM), the authors eliminate assumption limitation in LDA (latent dirichlet allocation) to exploit the structure information during generating summary, and use multi-features based on document content to improve the summary quality. Furthermore, a method for developing single-document summarization to multi-document summarization without breaking document structure is proposed, to achieve the perfect automatic summarization system. Meanwhile, experiment results on the standard dataset DUC2007 show the advantage of HTMM and multi-feature. Compared with the performace of LDA, ROUGE values are improved based on HTMM with multi-features.

Related Articles | Metrics | Comments（0）

Select

Japanese Time Expression Recognition and Translation

ZHAO Ziyu,XU Jin’an,ZHANG Yujie,LIU Jiangming

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （1034）

PDF（pc）（482KB）（1203）

Save

Based on the defined knowledge base, the authors presented a Japanese time expression recognition method through combining rules set strengthened by knowledge base with statistical model. In order to increase recognition accuracy, according to the Timex2 standards’ granular classification on time, the knowledge base was progressively expanded and reconstructed given the Japanese time characteristic to achieve rules set optimization and update. Simultaneously, CRF model was fused to enhance the generalization ability of Japanese time expression recognition. The authors studied the time translation accuracy of phrase-based translation model and proved the necessity of combing rules with statistical machine translation (SMT). Experiment results show that the F1 value of Japanese time expression recognition reaches 0.8987 on open test, and both the precision and recall by the method based on rules and parallel dictionary of Japanese to Chinese time expression are a bit higher than those by the method based on statistical translation model.

Related Articles | Metrics | Comments（0）

Select

Chinese Word Segmentation for Patent Documents

YUE Jinyuan,XU Jin’an,ZHANG Yujie

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （748）

PDF（pc）（480KB）（551）

Save

According to the characteristics of the patent documents, the authors present a statistics approach for Chinese word segmentation based on domain dictionaries. NC-value algorithm and conditional random fields model (CRF) are adopted for the domain terms extraction, to solve the unknown words recognition issue. The experimental results show that the proposed method can improve the efficiency of the word segmentation and the identification of the unknown words. For an open test, the precision of the experimental results is 95.56 %, the recall-rate is 96.18%, and F-measure is 95.87%.

Related Articles | Metrics | Comments（0）